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Abstract 



The 'star paradox' in phylogenetics is the tendency for a particular 
resolved tree to be sometimes strongly supported even when the data 
is generated by an unresolved ('star') tree. There have been contrary 
claims as to whether this phenomenon persists when very long sequences 
are considered. This note settles one aspect of this debate by proving 
mathematically that there is always a chance that a resolved tree could 
be strongly supported, even as the length of the sequences becomes very 
large. 



1 Introduction 



Two recent papers (Yang and Rannala 2005; Lewis, Holder and Holsingcr 2005) 
highlighted a phenomenon that occurs when sequences evolve on a tree that 
contains a polytomy - in particular a thrcc-taxon unresolved rooted tree. As 
longer sequences are analysed using a Bayesian approach, the posterior prob- 
ability of the trees that give the different resolutions of the polytomy do not 
converge on relatively equal probabilities - rather a given resolution can some- 
times have a posterior probability close to one. In response Kolaczkowski and 
Thornton (2006) investigated this phenomena further, providing some interest- 
ing simulation results, and offering an argument that seems to suggest that for 
very long sequences the tendency to sometimes infer strongly supported reso- 
lutions suggested by the earlier papers would disappear with sufficiently long 
sequences. As part of their case the authors use the expected site frequency 
patterns to simulate the case of infinite length sequences, concluding that "with 
infinite length data, posterior probabilities give equal support for all resolved 
trees, and the rate of false inferences falls to zero." Of course these findings con- 
cern sequences that are effectively infinite, and, as is well known in statistics, 
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the limit of a function of random variables (in this case site pattern frequencies 
for the first n sites) does not necessarily equate with the function of the limit 
of the random variables. Accordingly Kolaczkowski and Thornton offer this 
appropriate cautionary qualification of their findings: 

"Analysis of ideal data sets does not indicate what will happen when very 
large data sets with some stochastic error are analyzed, but it does show that 
when infinite data are generated on a star tree, posterior probabilities are pre- 
dictable, equally supporting each possible resolved tree." 

Yang and Rannala (2005) had attempted to simulate the large sample pos- 
terior distribution, but ran into numerical problems and commented that it 
was "unclear" what the limiting distribution on posterior probabilities was as n 
became large. 

In particular, all of the aforementioned papers have left open an interesting 
statistical question, which this short note formally answers - namely, does the 
Bayesian posterior probability of the three resolutions of a star tree on three 
taxa converge to 1/3 as the sequence length tends to infinity? That is, does the 
distribution on posterior probabilities for 'very long sequences' converge on the 
distribution for infinite length sequences? We show that for most reasonable 
priors it does not. Thus the 'star paradox' does not disappear as the sequences 
get longer. 

As noted by (Yang and Rannala 2005; Lewis, Holder and Holsinger 2005) one 
can demonstrate such phenomena more easily for related simpler processes such 
as coin tossing (particularly if one imposes a particular prior) . Here we avoid this 
simplification to avoid the criticism that such results do not rigorously establish 
corresponding phenomena in the phylogcnetic setting, which in contrast to coin 
tossing involves considering a parameter space of dimension greater than 1. We 
also frame our main result so that it applies to a fairly general class of priors. 
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Note also that it is not the purpose of this short note to add to the on-going 
debate concerning the implications of this 'paradox' for Bayesian phylogenetic 
analysis, we merely demonstrate its existence. Some further comments and 
earlier references on the phenomenon have been described in the recent review 
paper by Alfaro and Holder 2006 (pp. 35-36). 
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Figure 1: The three resolved rooted phylogenetic trees on three taxa Ti, T2, T3, 
and the unresolved 'star' tree on which the sequences are generated T . 

2 Analysis of the star tree paradox for three 
taxa 

On tree Ti (in Fig. 1) let pi = pi(t ,ti), i = 0, 1,2,3 denote the probabilities 
of the four site patterns (xxx, xxy, yxx, xyx, respectively) under the simple 2- 
state symmetric Markov process (the argument extends to more general models, 
but it suffices to demonstrate the phenomena for this simple model). From Eqn. 
(2) of (Yang and Rannala 2005) we have 



Ti 



T 2 



T 3 



T 



Po(*o,*i) = 4(1 + 



e~ 4tl +2e~ 4 ( t0+t1 )), 



Pi(*0,*i) = 4(1 + e 



-4ti 
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and 



P2(to,ti)=P3(*o,ti) = ^(1-e- 4 * 1 )- 



It follows by elementary algebra that for i — 2,3 

Pi(*o,*i) > 1 + 2e -4ti(i_ e -4to) ) (] , 



Pi(*0,*l) 

and thus pi(to,ii) > Pi{to,t\) with strict inequality unless t = (or in the 
limit as t\ tends to infinity). 

To allow maximal generality we make only minimal neutral assumptions 
about the prior distribution on trees and branch lengths. Namely we assume 
that the three resolved trees on three leaves (trees Ti,T 2 , T 3 in Fig. 1) have equal 
prior probability ^ and that the prior distribution on branch lengths t 0} t\ is the 
same for each tree, and has a continuous joint probability density function that is 
everywhere non-zero. This condition applies for example to the exponential and 
gamma priors discussed by Yang and Rannala (2005). Any prior that satisfies 
these conditions we call reasonable. Note that we do not require that to and t\ 
be independent. 

Let n = (no, n\, n 2 , n 3 ) be the counts of the different types of site patterns 
(corresponding to the same patterns as for the p^s). Thus n = Y^i=a n i 1S 
the total number of sites (i.e. the length of the sequences). Given a prior 
distribution on (t ,ti) for the branch lengths of (for i = 1,2,3) let P[Tj|n] 
be the posterior probability of tree Tj given the site pattern counts n. Now 
suppose the n sites are generated on a star tree To with positive branch lengths. 
We are interested in whether the posterior probability P[Tj|n] could be close to 
1 or whether the chance of generating data with this property goes to zero as 
the sequence length gets very large. We show that in fact the latter possibility 
is ruled out by our main result, namely the following: 

Theorem 2.1 Consider sequences of length n generated by a star tree T on 
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three taxa with strictly positive edge length and let n be the resulting data 
(in terms of site pattern counts). Consider any prior on the three resolved trees 
(Ti,T2,T3) and their branch lengths that is reasonable (as defined above). For 
any e > 0, and each resolved tree Tj (i = 1,2,3), the probability that n has the 
property that 

P(T|n) > 1 - e 
does not converge to as n tends to infinity. 

Proof of Theorem \2.1\ Consider the star tree To with given branch lengths t\ 
(as in Fig. 1). Let (qo, q%, <?2> Q3) denote the probability of the four types of site 
patterns generated by To with these branch lengths. Note that qi = q% = (73 and 
so qo = 1 — 3gi). Suppose we generate n sites on this tree, and let no, n\, 112, ria 
be the counts of the different types of site patterns (corresponding to the ft's). 
Let A := n °-^9 n and for i = 1.2,3 let 

_ nj - \{n - n ) 

t^i '■— 7= 

v n 

For a constant c > 1, let F c denote the event: 

F c : A 2 , A 3 G [-2c, -c] and A G [-c, c]. 

Notice that F c implies Ai G [2c, 4c] since Ai + A2 + A3 = 0. By standard 
stochastic arguments (based on the asymptotic approximation of the multino- 
mial distribution by the multinormal distribution) event F c has probability at 
least some value 6' — S'(c) > for all n sufficiently large (relative to c). 

Given the data n = (no, n\, n 2 , TI3) the assumption of equality of priors 
across T± , T 2 and T3 implies that 

P(n ,ni,n2,n 3 |T2,to,ii) = P(no, n 2 , n 3 , ni|T x , t , ti), (2) 

and 

P(n ,ni,n2,n 3 |T3,to,ii) = P("o, n 3 , m, n 2 |Ti, t , ti). (3) 
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Now, as (ta,ti) are random variables with some prior density, when we view 
Po,PiiP2,P3 as random variables by virtue of their dependence on (to,t%), we 
will write them as Po, Pi, P2, P3 (note that Yang and Rannala (2005) use Pi 
differently). With this notation, the posterior probability of T\ conditional on 
n can be written as 

P(Ti|n) -p(n)- 1 x E 1 [P no P 1 " 1 P" 2 P 3 n3 ] 

where pip) is the posterior probability of n and Ei denotes expectation with 
respect to the prior for to,t\ on T\. Moreover since P2 = P3, we can write this 
as P(Ti|n) = pin)- 1 x Ei[P^°P 1 ll P 2 " 2+ ™ 3 ]. By © and © we have 

P(T 2 |n) =p(n)- 1 xE 1 [P no P 1 " 2 P 2 " 1+ " 3 ]; and P(T 3 |n) = p(n)- 1 xEi [P^°P^P^ 1 + 

where again expectation is taken with respect to the prior for to,t\ on T\. 
Consequently, 

TO In) = MX] , , 

P(T 2 |n) Ei[F]' 1 ' 

where 

X = p»op»ip«2+n3 and y = pnopn2pni+n 3 _ 

As will be shown later, it suffices to demonstrate that the ratio in ((4]) can be 
large with nonvanishing probability in order to obtain the conclusion of the 
theorem. In order to do so we use the following lemma, whose proof is provided 
in the Appendix. 

Lemma 2.2 Let X, Y be non-negative continuous random variables, dependent 
on a third random variable A that takes values in an interval I = [a, b]. Suppose 
that for some interval Iq strictly inside I, and I\ = I—Iq the following inequality 
holds: 

Err|Aei ]>E[r|Aeii], (5) 

and that for some constant B > 0, and all A £ Jq, 

E[X|A = A] 



E[F|A = A] 



> B. (6) 



Then, |gj > B-P(Ag/ )- 



To apply this lemma, select a value s > so that | + s < go < 1 — s, and let 
io = [go - s, g + s]. Then let I = [|, 1] and h = I - I Q . 

Claim: For n sufficiently large, and conditional on the data n = (no, n\, ri2, n^) 
satisfying F c : 

(i) E 1 [Y\P el ]>E 1 [Y\P () el 1 } 

(") For all p e I , gjj^f > 6c 2 . 

The proofs of these two claims is given in the Appendix. 

Applying Lemma T2. 21 to the Claims (i) and (ii) we deduce that conditional 
on n satisfying F c and n being sufficiently large, 

Ex[X] 



>6c 2 -P(P e/ ). (7) 



Select c > , 1 (this is finite by the assumption that the prior on (in , ti ) 
is everywhere non-zero). As stated before, the probability that n satisfies F c 
is at least 5' = 5'(c) > for n sufficiently large. Then, 6c 2 • P(P € 2q) > f 
and so by O, = M > f . Similarly, fgg} > f . Now, since P(T 1 |n) + 

P(T2|n)+P(T3|n) = 1 it now follows that, for n sufficiently large, and conditional 
an event of probability at least 8' > 0, that P(Ti|n) > 1 — e as claimed. This 
completes the proof. □ 
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2.1 Concluding remarks 

One feature of the argument we have provided is that it does not require stip- 
ulating in advance a particular prior on the branch lengths - that is, the result 
is somewhat generic as it imposes relatively few conditions. Moreover, the re- 
quirement that the prior on (to,t\) be everywhere non-zero could be weakened 
to simply being non-zero in a neighborhood of (0, t?) (thereby allowing, for 
example, a uniform distribution on bounded range). 

A interesting open question in the spirit of this paper is to explicitly calculate 
the limit of the posterior density f{P\,P-2,P^) described in (Yang and Rannala 
2005). 
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3 Appendix: Proof of Lemma 12.21 and Claims 
(i). (iO 



Proof of Lemma [KM For W = X, Y we have 

E[W] = E[W\A e 7 ]P(A £ 7 ) + E[W\A e h]¥(A e h). (8) 

In particular, for W = X we have: E[X] > E[X\A G 7 ]P(A G 7 ). Note that 
© implies that E[AT|A G 7 ] > B ■ E[Y\A G 7 ], so 

ELY] >B-E[y|Ae7 ]P(Ae7 ). (9) 

Taking W = Y in §E§ and applying (J5J) gives us 

E[y] < E[Y"|A G 7 ](P(A G 7 ) + P(A G 7i)) = E[F|A G 7 ] 

which combined with ([9]) gives the result. □ 

Proof of Claim (i), Ei[y|P € 7 ] > Ei[y|P € h]: 

We will first bound Ei[F|P G 7i] above. Let //(n) = (go ?! 1 ???! 3 )™- Now , 
conditional on n satisfying P c we have 

n- 1 log (fi{n)/Y (t , h j) = d KL (q,p) + o(l), 

where p = (po,Pi,P2,P3) and g = (go, ?i ; ?2, ?3), and denotes Kullback- 
Leibler distance. Now, (Ikl{<IiP) > — P\\i ^ ll?o — Po| 2 (the first inequality 
is a standard one in probability theory). In particular, if po G Ii, then |go — Po | > 
s > 0. Moreover, 

Ei[T|Po e h] < max{y(t ,ti) :po(*o,*i) € 7!}. 

Summarizing, 

Ei[Y|P G 7 X ] < nua{r(*o,*i) :po(*0,*i) € 7 X } < p(n) e -^ 2 " +0 ^. (10) 
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In the reverse direction, we have: 

Ei[y|P e/ ] >A(n)B(n) 



where 



and 



Now, 



A(n) =min{r(i ,<i) : (t ,h) € [O,^ 1 ] x [t°,i° + n" 1 ]} 
B(n) = P ((to, t x ) € [0, n" 1 ] x [*?, t° + n- 1 ]). 



„9o <Zi 2gi 



A(n)Mn) = P -^_ . ^-^o p A 1+ A3-|A 0) ^ 

V % ii ) 

Now, the first term of this product converges to a constant as n grows (because 
Pa ~ Qa,Pi ~ Qi and P2 — Qi are each of order n~ r ) while the condition F c ensures 
that the second term decays no faster than e _Clv/ ™ for a constant C\. Thus, 
A(n) > C2fJ-(n)e~ Cl ^™ for a positive constant C2. The term B{n) is asymptoti- 
cally proportional to n~ 2 . Summarizing, for a constant C3 > (dependent just 
on t§) 

Ei[y|P S Io] > CMn)n- 2 e~ c ^, 
which combined with (TTUj) establishes claim (i) for n sufficiently large. □ 

In order to prove claim (ii) we need some preliminary results. 

Lemma 3.1 Let rj < 1. Then for each x > there exists a value K = K{x) < 
00 that depends continuously on x so that the following holds. For any continu- 
ous random variable Z on [0, 1] with a smooth density function f that satisfies 
/(l) 7^ and |/'(z)| < B for all z G (77, 1], we have 

(E[Z k ] - E[Z k+1 ]) > I 
E[Z k ] ~ 2 

for all k > K(^). 
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Proof. Let i fc = 1 - Then 



E[Z k ] = f " t k f{t)dt + f t k f(t)dt. 
Jo Jtk 



Now, 

o < / " t k f(t)dt < t k k ~ e -^- 1/2 , 

Jo 

where ~ denotes asymptotic equivalence (i.e. f(k) ~ iff linife^oo f(k)/g(k) = 
1). Using integration by parts, 

Now, provided fe > (1 — r?)~ 2 we have t k > r\ and so the absolute value 
of the second term on the right is at most J^t k+1 dt = ( fc+1 ^ fc+2 ) (1 — 
t^ +2 ). Consequently, E[Z k ] — j^l [ s bounded above by B times a term of 
order fc -2 . A similar argument, again using integration by parts, shows that 
k(E[Z k ] - E[Z k+1 }) - ^ is bounded above by B times a term of order fc~ 2 , 
and the lemma now follows by some routine analysis. □ 

Lemma 3.2 Let y = (1 + 2a) (1 - xf . Then for x € [0, 1) and m > 3 we have 

>m 2 (l-y). 



l-x 



Proof. 



1 + 2a;^ m / 3x \ m ^ m(m — 1) / 3i \ ^ 9m(m — 1).t 2 



l-x) V l-a;/ ~ 2 Vl-a;/ ~ 2 

and m 2 (l — y) = m 2 (3x 2 — 2x 3 ) < 3m 2 x 2 and for m > 3 this upper bound is 
less than the lower bound in the previous expression. □ 

Proof of Claim (ii), for all Po e I , 1[\y\p Z P p ° ] > 6c 2 : 

Write Ei[W|p ] as shorthand for E[W|-P = Po}- Note that, for any r, s > 0, 
Ei^PfPHpo] = Po°Ei[P[Pi\po}- Consequently, if we let k = k{n) = §(n - 
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no) then, by definition of the Aj's, 



E 1 [X\p ] _ E^P 2 )* • (P 1 Al P 2 A2+A3 )^| Po ] 



(11) 



Ei[Y\ Po ] M{PiP$) k ■ (P 1 A2 P 2 Al+A3 )^bo] 
Now, conditional on n satisfying F c (and since Pi > P 2 ) the following two 

inequalities hold 

jAi p a 2 +a 3 _ 



Ai 



> 



and P^P^ +A * 



A 2 



< 1. 



Applying this to (fTTj) gives: 

Ei[X|po] 
Ei[K|po] 



> 



2\fc 



(AP2) 







^ 2CV/JI 





Ei[(PiP 2 2 ) fc | Po ] 



(12) 



Let f/ = P i_p^ , which takes values between and 1 because P\ > P 2 ■ Since 
Pi+2P 2 = 1-P , we can write Pj = ±(1+2P)(1-P ) andP 2 = ±(1-P)(1-P ). 
Thus, PiP| = ^(l + 2[/)(l-P) 2 (l-P ) 3 and & = 'j^-. Substituting these 
into ([12"]), letting Z = (1 + 2U)(1 - U) 2 and noting that V™ > V3fc gives 



Ei[A> ] ^ El 



> 



1+2U \2cV3k 



( 1+2U \ 
V 1-U > 



Po 



Ei[y|po] " Ei[Z fe bo] 
Thus, by Lemma l3.2i (taking x — U,y — Z,m — 2cV3fc) we obtain, for m > 3, 

(13) 



Ei[A> ] > i2c2fc _ (Ei[Z fc | Po ] - E!^ 1 ^]) 



Ei[y|p ] " '" Ei[Z fc |p ] 
Now the mapping (io,ii) 1— > (Po,Z) is a smooth invertible mapping between 

(0, oo) 2 and its image within (j, 1) x (0, 1). Notice that Z approaches 1 whenever 

Po approaches \ or 1 (in particular, even if io^i are independent, Pq and Z 

generally will not be). However over the interval 1$ the conditional density 

f(Z\Po=po)ofZ given a value po for Po is smooth and bounded away from 0, 

and its first derivative is also bounded above over this interval. Consequently, we 

may apply Lemma l3.ll (noting that the condition that n satisfies F c ensures that 

k{n) > \n — o(n)) to show that for n sufficiently large the following inequality 

holds for all po & Iq, 

(Ei[Z> ]~Ei[Z fc + 1 bo]) 1 
Ei[Z fe |p ] " 2' 
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Applying this to (fT3j) gives e* y p°| — ^c 2 as claimed. This completes the proof 
of Claim (ii). 
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